【Day 09】- 大家都愛的 BeautifulSoup

2021 iThome 鐵人賽

DAY 9

AI & Data

網路爬蟲，萬物皆爬 - 30 天搞懂並實戰網路爬蟲及應對反爬蟲技術系列第 9 篇

13th鐵人賽

Vincent55

團隊肝已經，死了

2021-09-24 20:38:28

44525 瀏覽

分享至

前情提要

前一篇文章帶大家看了Requests-HTML 庫的使用，用他來做資料清洗使我們真正想要的資料能夠從一堆資料內被清理出來。

開始之前

Requests 庫本身不具有資料清洗的功能，需要其他工具來輔助清理，今天要來介紹有名的 BeautifulSoup 這個套件。

BeautifulSoup 是一個 Python 的函式庫，可以從 HTML 或 XML 檔案中分析資料，也可拿來修復未閉合標籤等錯誤的文件。

BeautifulSoup 與解析器安裝

使用以下指令安裝 BeautifulSoup。

pipenv install beautifulsoup4

另外，BeautifulSoup 分析資料前需要有解析器來做預處理，雖然標準函式庫內有一個 html.parser 但本文會使用 html5lib 作為我們的解析器，這個的解析器的容錯率較強、速度較慢，較追求速度的讀者能使用其他的解析器歐。

※ 注意，因為容錯率，不同解析器解析出來的資料可能會與實際的不同。如果找不到欲找的元素，能考慮換一個解析器， html5lib 是容錯率最高的解析器。

使用以下指令安裝 html5lib 解析器。

pipenv install html5lib

BeautifulSoup 使用

首先我們能來先解析一個 HTML ，範例中的 HTML 是 https://ithelp.ithome.com.tw/users/20134430/ironman/4307，使用 Requests 套件進行爬取，若還沒看過 Requests 套件使用的讀者能去看這篇【Day 07】- 第一隻網路爬蟲要用什麼函式庫? (Requests)。

import requests
from bs4 import BeautifulSoup

url = 'https://ithelp.ithome.com.tw/users/20134430/ironman/4307'
#發送 GET 請求到 url，並將回應物件放到 resp
resp = requests.get(url)
# 將 resp.text 也就是 HTML 資料定義到 BeautifulSoup 物件內，並用 html5lib 解析 HTML 內容
soup = BeautifulSoup(resp.text, 'html5lib')

# 輸出網頁的 title
print(soup.title.getText())

#輸出第一個尋找到的 <li> 元素的文字
print(soup.li.getText())

#輸出第一個尋找到的 <li> 元素的文字(相同效果)
print(soup.find('li').getText())

#尋找全部 <li> 元素的文字
lis = soup.find_all('li')
for li in lis:
    print(li.getText())

取得標籤屬性

若想在一個標籤內取得該標籤的屬性，只需像字典一樣操作即可。

例如有個標籤為 <a href='https://www.google.com'>OwO</a> ， soup.a['href'] 即可取得該標籤的屬性 https://www.google.com

import requests
from bs4 import BeautifulSoup

url = 'https://ithelp.ithome.com.tw/users/20134430/ironman/4307'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html5lib')

links = soup.find_all('a')
for link in links:
    if 'href' in link.attrs:
        print(link['href'])

BeautifulSoup 定位

soup.find() : 根據條件回傳"第一個"符合的元素，由字串表示，若沒有符合的則回傳 None。
soup.find_all() : 根據條件回傳"全部"符合的元素，由串列表示，若沒有符合的則回傳空串列。
soup.select() : Css Selector。

可以透過標籤、 id 或 class 來定位元素， soup.find('p', id='myid', class_='myclass') ，注意 class 後方必須加上底線，為了避免與 Python 的關鍵字 class 衝突。

例子

假設我們想要爬取【Day 01】- 前言: 從 0 開始的網路爬蟲的文章網址，首先先用選取工具選取該元素，發現它在 <a href="[https://ithelp.ithome.com.tw/articles/10263628](https://ithelp.ithome.com.tw/articles/10263628)" class="qa-list__title-link">【Day 01】- 前言: 從 0 開始的網路爬蟲</a> ，於是我們可以開始定位此元素了。

首先觀察到該標籤是並且有個 class 叫做 qa-list__title-link ，這個例子十分容易，讀者可以參考下方程式碼。

import requests
from bs4 import BeautifulSoup

url = 'https://ithelp.ithome.com.tw/users/20134430/ironman/4307'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html5lib')

link = soup.find('a', class_='qa-list__title-link')
print(link['href'].strip())